30 research outputs found
Heterogeneous Graph Learning for Acoustic Event Classification
Heterogeneous graphs provide a compact, efficient, and scalable way to model
data involving multiple disparate modalities. This makes modeling audiovisual
data using heterogeneous graphs an attractive option. However, graph structure
does not appear naturally in audiovisual data. Graphs for audiovisual data are
constructed manually which is both difficult and sub-optimal. In this work, we
address this problem by (i) proposing a parametric graph construction strategy
for the intra-modal edges, and (ii) learning the crossmodal edges. To this end,
we develop a new model, heterogeneous graph crossmodal network (HGCN) that
learns the crossmodal edges. Our proposed model can adapt to various spatial
and temporal scales owing to its parametric construction, while the learnable
crossmodal edges effectively connect the relevant nodes across modalities.
Experiments on a large benchmark dataset (AudioSet) show that our model is
state-of-the-art (0.53 mean average precision), outperforming transformer-based
models and other graph-based models.Comment: arXiv admin note: text overlap with arXiv:2207.0793
Unsupervised discovery of character dictionaries in animation movies
Automatic content analysis of animation movies can enable an objective understanding of character (actor) representations and their portrayals. It can also help illuminate potential markers of unconscious biases and their impact. However, multimedia analysis of movie content has predominantly focused on live-action features. A dearth of multimedia research in this field is because of the complexity and heterogeneity in the design of animated characters-an extremely challenging problem to be generalized by a single method or model. In this paper, we address the problem of automatically discovering characters in animation movies as a first step toward automatic character labeling in these media. Movie-specific character dictionaries can act as a powerful first step for subsequent content analysis at scale. We propose an unsupervised approach which requires no prior information about the characters in a movie. We first use a deep neural network-based object detector that is trained on natural images to identify a set of initial character candidates. These candidates are further pruned using saliency constraints and visual object tracking. A character dictionary per movie is then generated from exemplars obtained by clustering these candidates. We are able to identify both anthropomorphic and nonanthropomorphic characters in a dataset of 46 animation movies with varying composition and character design. Our results indicate high precision and recall of the automatically detected characters compared to human-annotated ground truth, demonstrating the generalizability of our approach
A dataset for Audio-Visual Sound Event Detection in Movies
Audio event detection is a widely studied audio processing task, with
applications ranging from self-driving cars to healthcare. In-the-wild datasets
such as Audioset have propelled research in this field. However, many efforts
typically involve manual annotation and verification, which is expensive to
perform at scale. Movies depict various real-life and fictional scenarios which
makes them a rich resource for mining a wide-range of audio events. In this
work, we present a dataset of audio events called Subtitle-Aligned Movie Sounds
(SAM-S). We use publicly-available closed-caption transcripts to automatically
mine over 110K audio events from 430 movies. We identify three dimensions to
categorize audio events: sound, source, quality, and present the steps involved
to produce a final taxonomy of 245 sounds. We discuss the choices involved in
generating the taxonomy, and also highlight the human-centered nature of sounds
in our dataset. We establish a baseline performance for audio-only sound
classification of 34.76% mean average precision and show that incorporating
visual information can further improve the performance by about 5%. Data and
code are made available for research at
https://github.com/usc-sail/mica-subtitle-aligned-movie-sound
LanSER: Language-Model Supported Speech Emotion Recognition
Speech emotion recognition (SER) models typically rely on costly
human-labeled data for training, making scaling methods to large speech
datasets and nuanced emotion taxonomies difficult. We present LanSER, a method
that enables the use of unlabeled data by inferring weak emotion labels via
pre-trained large language models through weakly-supervised learning. For
inferring weak labels constrained to a taxonomy, we use a textual entailment
approach that selects an emotion label with the highest entailment score for a
speech transcript extracted via automatic speech recognition. Our experimental
results show that models pre-trained on large datasets with this weak
supervision outperform other baseline models on standard SER datasets when
fine-tuned, and show improved label efficiency. Despite being pre-trained on
labels derived only from text, we show that the resulting representations
appear to model the prosodic content of speech.Comment: Presented at INTERSPEECH 202
Representation of professions in entertainment media: Insights into frequency and sentiment trends through computational text analysis.
Societal ideas and trends dictate media narratives and cinematic depictions which in turn influence people's beliefs and perceptions of the real world. Media portrayal of individuals and social institutions related to culture, education, government, religion, and family affect their function and evolution over time as people perceive and incorporate the representations from portrayals into their everyday lives. It is important to study media depictions of social structures so that they do not propagate or reinforce negative stereotypes, or discriminate against a particular section of the society. In this work, we examine media representation of different professions and provide computational insights into their incidence, and sentiment expressed, in entertainment media content. We create a searchable taxonomy of professional groups, synsets, and titles to facilitate their retrieval from short-context speaker-agnostic text passages like movie and television (TV) show subtitles. We leverage this taxonomy and relevant natural language processing models to create a corpus of professional mentions in media content, spanning more than 136,000 IMDb titles over seven decades (1950-2017). We analyze the frequency and sentiment trends of different occupations, study the effect of media attributes such as genre, country of production, and title type on these trends, and investigate whether the incidence of professions in media subtitles correlate with their real-world employment statistics. We observe increased media mentions over time of STEM, arts, sports, and entertainment occupations in the analyzed subtitles, and a decreased frequency of manual labor jobs and military occupations. The sentiment expressed toward lawyers, police, and doctors showed increasing negative trends over time, whereas the mentions about astronauts, musicians, singers, and engineers appear more favorably. We found that genre is a good predictor of the type of professions mentioned in movies and TV shows. Professions that employ more people showed increased media frequency
Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data
Large scale databases with high-quality manual annotations are scarce in
audio domain. We thus explore a self-supervised graph approach to learning
audio representations from highly limited labelled data. Considering each audio
sample as a graph node, we propose a subgraph-based framework with novel
self-supervision tasks that can learn effective audio representations. During
training, subgraphs are constructed by sampling the entire pool of available
training data to exploit the relationship between the labelled and unlabeled
audio samples. During inference, we use random edges to alleviate the overhead
of graph construction. We evaluate our model on three benchmark audio
databases, and two tasks: acoustic event detection and speech emotion
recognition. Our semi-supervised model performs better or on par with fully
supervised models and outperforms several competitive existing models. Our
model is compact (240k parameters), and can produce generalized audio
representations that are robust to different types of signal noise